fix a few bugs #270

BurntSushi · 2016-08-05T00:21:17Z

This fixes both #264 and #268. It also fixes an unreported bug where the DFA cache size could grow (a lot) bigger than the bound set by the caller.

Typically, when a DFA blows up in size, it happens for two reasons: 1. It accumulates many states. 2. Each state accumulates more and more NFA states. Our previous approximation for the size of the DFA accounted for (1) but used a constant for the size of (2). This can turn out to result in very large differences (in the MBs) between the approximate and actual size of the DFA. Since computing the actual size is expensive, we compute it as a sum as states are added. The end result is that we more stringently respect the memory set by the caller.

The specific problem here is that our literal search doesn't know about anchors, so it will try to search all of the detected literals in a regex. In a regex like `a|^b`, the literal `b` should only be searched for at the beginning of the haystack and in no other place. The right way to fix this is probably to make the literal detector smarter, but the literal detector is already too complex. Instead, this commit detects whether a regex is partially anchored (that is, when the regex has at least one matchable sub-expression that is anchored), and if so, disables the literal engine. Note that this doesn't disable all literal optimizations, just the optimization that opts out of regex engines entirely. Both the DFA and the NFA will still use literal prefixes to search. Namely, if it searches and finds a literal that needs to be anchored but isn't in the haystack, then the regex engine rules it out as a false positive. Fixes rust-lang#268.

If the caller asks for captures, and the DFA runs, and there's a match, and there are actually captures in the regex, then the haystack sent to the NFA is shortened to correspond to only the match plus some room at the end for matching zero-width assertions. This "room at the end" needs to be big enough to at least fit an UTF-8 encoded Unicode codepoint. Fixes rust-lang#264.

Docopt uses lazy_static! 2.x, but lazy_static required a new minimum Rust version in 2.1.

BurntSushi added 4 commits August 4, 2016 18:58

Don't build regex-debug on Rust 1.3.

16931b0

Docopt uses lazy_static! 2.x, but lazy_static required a new minimum Rust version in 2.1.

BurntSushi merged commit 11447f0 into rust-lang:master Aug 5, 2016

BurntSushi deleted the fixes branch August 5, 2016 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix a few bugs #270

fix a few bugs #270

BurntSushi commented Aug 5, 2016

fix a few bugs #270

fix a few bugs #270

Conversation

BurntSushi commented Aug 5, 2016